bottleneck block
ResNEsts and DenseNEsts: Block-based DNN Models with Improved Representation Guarantees
Models recently used in the literature proving residual networks (ResNets) are better than linear predictors are actually different from standard ResNets that have been widely used in computer vision. In addition to the assumptions such as scalar-valued output or single residual block, the models fundamentally considered in the literature have no nonlinearities at the final residual representation that feeds into the final affine layer. To codify such a difference in nonlinearities and reveal a linear estimation property, we define ResNEsts, i.e., Residual Nonlinear Estimators, by simply dropping nonlinearities at the last residual representation from standard ResNets. We show that wide ResNEsts with bottleneck blocks can always guarantee a very desirable training property that standard ResNets aim to achieve, i.e., adding more blocks does not decrease performance given the same set of basis elements. To prove that, we first recognize ResNEsts are basis function models that are limited by a coupling problem in basis learning and linear prediction.
Appendices A Proofs
This part contains the proofs of Lemma 3.1 and Theorem 3.2. We also restate Lemma 3.1 and Theorem 3.2 using the new notations here, so that this part can be Under the new notations here, Equation (2) in Section 3.2 becomes: f (x) = tnull xnull By Lemma A.2, the optimal solution of (7) can be found on the vertices of Now we consider the discreteness constraint. The detailed parameters in the training and pruning stages of our method are listed in Table 4. The main building block of ResNet-50 is the bottleneck block [He et al., 2016], as shown in Figure 5. " of a bottleneck block are already 0 or very close to 0. So we do not apply any extra We visualize the layer-wise distribution of scaling factors in Figure 6. Figure 6 compares the layer-wise distributions of scaling factors between the baseline ResNet-50 model and the model trained with our polarization regularizer on ImageNet dataset. Figure 6: Comparison of the layer-wise scaling factor distributions between baseline ResNet-50 model and the model trained with our polarization regularizer on ImageNet dataset.
Appendices A Proofs
This part contains the proofs of Lemma 3.1 and Theorem 3.2. We also restate Lemma 3.1 and Theorem 3.2 using the new notations here, so that this part can be Under the new notations here, Equation (2) in Section 3.2 becomes: f (x) = tnull xnull By Lemma A.2, the optimal solution of (7) can be found on the vertices of Now we consider the discreteness constraint. The detailed parameters in the training and pruning stages of our method are listed in Table 4. The main building block of ResNet-50 is the bottleneck block [He et al., 2016], as shown in Figure 5. " of a bottleneck block are already 0 or very close to 0. So we do not apply any extra We visualize the layer-wise distribution of scaling factors in Figure 6. Figure 6 compares the layer-wise distributions of scaling factors between the baseline ResNet-50 model and the model trained with our polarization regularizer on ImageNet dataset. Figure 6: Comparison of the layer-wise scaling factor distributions between baseline ResNet-50 model and the model trained with our polarization regularizer on ImageNet dataset.
ResNEsts and DenseNEsts: Block-based DNN Models with Improved Representation Guarantees
Models recently used in the literature proving residual networks (ResNets) are better than linear predictors are actually different from standard ResNets that have been widely used in computer vision. In addition to the assumptions such as scalar-valued output or single residual block, the models fundamentally considered in the literature have no nonlinearities at the final residual representation that feeds into the final affine layer. To codify such a difference in nonlinearities and reveal a linear estimation property, we define ResNEsts, i.e., Residual Nonlinear Estimators, by simply dropping nonlinearities at the last residual representation from standard ResNets. We show that wide ResNEsts with bottleneck blocks can always guarantee a very desirable training property that standard ResNets aim to achieve, i.e., adding more blocks does not decrease performance given the same set of basis elements. To prove that, we first recognize ResNEsts are basis function models that are limited by a coupling problem in basis learning and linear prediction.
Peeking Behind the Curtains of Residual Learning
Zhang, Tunhou, Yan, Feng, Li, Hai, Chen, Yiran
The utilization of residual learning has become widespread in deep and scalable neural nets. However, the fundamental principles that contribute to the success of residual learning remain elusive, thus hindering effective training of plain nets with depth scalability. In this paper, we peek behind the curtains of residual learning by uncovering the "dissipating inputs" phenomenon that leads to convergence failure in plain neural nets: the input is gradually compromised through plain layers due to non-linearities, resulting in challenges of learning feature representations. We theoretically demonstrate how plain neural nets degenerate the input to random noise and emphasize the significance of a residual connection that maintains a better lower bound of surviving neurons as a solution. With our theoretical discoveries, we propose "The Plain Neural Net Hypothesis" (PNNH) that identifies the internal path across non-linear layers as the most critical part in residual learning, and establishes a paradigm to support the training of deep plain neural nets devoid of residual connections. We thoroughly evaluate PNNH-enabled CNN architectures and Transformers on popular vision benchmarks, showing on-par accuracy, up to 0.3% higher training throughput, and 2x better parameter efficiency compared to ResNets and vision Transformers.
- North America > United States > Texas > Harris County > Houston (0.04)
- North America > United States > North Carolina > Durham County > Durham (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
An Empirical Analysis of the Shift and Scale Parameters in BatchNorm
Batch Normalization (BatchNorm) is a technique that improves the training of deep neural networks, especially Convolutional Neural Networks (CNN). It has been empirically demonstrated that BatchNorm increases performance, stability, and accuracy, although the reasons for such improvements are unclear. BatchNorm includes a normalization step as well as trainable shift and scale parameters. In this paper, we empirically examine the relative contribution to the success of BatchNorm of the normalization step, as compared to the re-parameterization via shifting and scaling. To conduct our experiments, we implement two new optimizers in PyTorch, namely, a version of BatchNorm that we refer to as AffineLayer, which includes the re-parameterization step without normalization, and a version with just the normalization step, that we call BatchNorm-minus. We compare the performance of our AffineLayer and BatchNorm-minus implementations to standard BatchNorm, and we also compare these to the case where no batch normalization is used. We experiment with four ResNet architectures (ResNet18, ResNet34, ResNet50, and ResNet101) over a standard image dataset and multiple batch sizes. Among other findings, we provide empirical evidence that the success of BatchNorm may derive primarily from improved weight initialization.
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Implementing ConvNext in PyTorch
Hello There!! Today we are going to implement the famous ConvNext in PyTorch proposed in A ConvNet for the 2020s . Code is here, an interactive version of this article can be downloaded from here. The paper proposes a new convolution-based architecture that not only surpasses Transformer-based model (such as Swin) but also scales with the amount of data! The following pictures show ConvNext accuracy against the different datasets/models sizes. So the authors started by taking the well know ResNet architecture and iteratively improving it following new best practices and discoveries made in the last decade.
A New Measure of Model Redundancy for Compressed Convolutional Neural Networks
Huang, Feiqing, Si, Yuefeng, Zheng, Yao, Li, Guodong
While recently many designs have been proposed to improve the model efficiency of convolutional neural networks (CNNs) on a fixed resource budget, theoretical understanding of these designs is still conspicuously lacking. This paper aims to provide a new framework for answering the question: Is there still any remaining model redundancy in a compressed CNN? We begin by developing a general statistical formulation of CNNs and compressed CNNs via the tensor decomposition, such that the weights across layers can be summarized into a single tensor. Then, through a rigorous sample complexity analysis, we reveal an important discrepancy between the derived sample complexity and the naive parameter counting, which serves as a direct indicator of the model redundancy. Motivated by this finding, we introduce a new model redundancy measure for compressed CNNs, called the $K/R$ ratio, which further allows for nonlinear activations. The usefulness of this new measure is supported by ablation studies on popular block designs and datasets.
- Africa > Senegal > Kolda Region > Kolda (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Connecticut (0.04)
- (2 more...)
Video Classification with Channel-Separated Convolutional Networks
Tran, Du, Wang, Heng, Torresani, Lorenzo, Feiszli, Matt
Group convolution has been shown to offer great computational savings in various 2D convolutional architectures for image classification. It is natural to ask: 1) if group convolution can help to alleviate the high computational cost of video classification networks; 2) what factors matter the most in 3D group convolutional networks; and 3) what are good computation/accuracy trade-offs with 3D group convolutional networks. This paper studies different effects of group convolution in 3D convolutional networks for video classification. We empirically demonstrate that the amount of channel interactions plays an important role in the accuracy of group convolutional networks. Our experiments suggest two main findings. First, it is a good practice to factorize 3D convolutions by separating channel interactions and spatiotemporal interactions as this leads to improved accuracy and lower computational cost. Second, 3D channel-separated convolutions provide a form of regularization, yielding lower training accuracy but higher test accuracy compared to 3D convolutions. These two empirical findings lead us to design an architecture -- Channel-Separated Convolutional Network (CSN) -- which is simple, efficient, yet accurate. On Kinetics and Sports1M, our CSNs significantly outperform state-of-the-art models while being 11-times more efficient.
Hierarchical binary CNNs for landmark localization with limited resources
Bulat, Adrian, Tzimiropoulos, Georgios
Our goal is to design architectures that retain the groundbreaking performance of Convolutional Neural Networks (CNNs) for landmark localization and at the same time are lightweight, compact and suitable for applications with limited computational resources. To this end, we make the following contributions: (a) we are the first to study the effect of neural network binarization on localization tasks, namely human pose estimation and face alignment. We exhaustively evaluate various design choices, identify performance bottlenecks, and more importantly propose multiple orthogonal ways to boost performance. (b) Based on our analysis, we propose a novel hierarchical, parallel and multi-scale residual architecture that yields large performance improvement over the standard bottleneck block while having the same number of parameters, thus bridging the gap between the original network and its binarized counterpart. (c) We perform a large number of ablation studies that shed light on the properties and the performance of the proposed block. (d) We present results for experiments on the most challenging datasets for human pose estimation and face alignment, reporting in many cases state-of-the-art performance. (e) We further provide additional results for the problem of facial part segmentation. Code can be downloaded from https://www.adrianbulat.com/binary-cnn-landmark
- Europe > United Kingdom > England > Nottinghamshire > Nottingham (0.14)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > Romania (0.04)